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Abstract 

We introduce a simple, intuitive, yet powerful algorithm for clustering 
analysis. This algorithm stands from the viewpoint of elements to be clus- 
tered, and simulates the process of how they perform self-clustering. The 
algorithm is therefore named Self-Updating Process (SUP). We discover 
the algorithm's ability to simultaneously isolate noise while performing 
clustering, which enables the algorithm to produce good clustering results 
even when the level of noise in the data is high. We present simulation 
studies to demonstrate the performance of this algorithm. Applications 
to gene expression data and image segmentation are provided. 
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1 Introduction 

Clustering analysis is a useful technique to discover patterns in data. This tech- 
nique has been widely applied to many disciplines for partitioning data into 
several groups. Within each group, elements are considered to resemble each 
other. In image segmentation the clustering technique is applied to partition 
an image into regions, each of which has its own color patterns. In biology and 
medicine, the technique can be used to classify subjects on the basis of their 
clinical responses. The resulting grouping structure provides valuable informa- 
tion to discover subtypes of a disease. With the breakthrough in experimental 
molecular biology, clustering has rapidly received a considerable amount of at- 
tention and has become a powerful tool in exploring and identifying patterns in 
genome data. 
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A vast number of cluste ring algorithms have been de veloped in the literature. 
The model-based methods (jBanfield and Raftervl . 119931 ) make an assumption on 
the probabilistic distribution of data, and the distance-based methods employ 
the notion of "distance" that represents the similarity between two data points. 
Among the distance-based methods, tw o major types ar e most commonly used. 
The first type is hierarchical clustering (lHartiganl . [l975l) . which partitions data 
into groups through a series of agglomerative or divisive steps that operate on 
the similarity measure between data points. The structure of data is revealed 
through the process of hierarchical clustering and is presented by a tree diagram 
known as dendrogram. One weakness of hierarchical clustering is the irrevocable 
clustering assignments: A mistake made at early steps can never be corrected 
at later steps. 

The second type of distance-based methods is known as the partition cluster- 
ing. The clustering results are obtained as an optimal solution that either maxi- 
mizes or minimizes a criterion of some kind. The k-means algorithm (jMacQueen . 



1967t iLlovdl 119821) that adopts the criterion of minimizing the sum of squared 



distances from each data point to its closest cluster center is one clustering algo- 
rithm of this sort. Such algorithms, however, usually require an initial partition 
to start the iterative process, and the number of clusters has to be given a priori. 
In addition, this type of algorithms suffers from the problem of trapping into 
local minima (or maxima), which is a result of a poor selection of the initial 
partition. There exist methods to improve the perform ance of k-means algo- 
rithm, including the estimati on of the number of clusters ([Milligan and Cooper , 
198,4 iTibshirani et al.. 2001) and the selection of initial values to solve the local 



minima problem (jSelim and Alsultanl . Il991t iTseng and Wongl . l2005f ) . 

Despite its few weaknesses, k-means algorithm is by far the most popular 
clustering algorithm used in research and industrial applications. The speed and 
the simplicity make k-means algorithm appealing. However, it does not guaran- 
tee a better performance compared to other clustering algorithms. In addition 
to the selection of an initial partition that may result in a poor performance, 
the criterion used by k-means to define the "optimal" groupings sometimes is 
inappropriate. Section 4 shows an example in which the structure of data does 
not conform to this criterion. 

So it comes to the major challenge that every clustering problem encounters: 
How to define an appropriate criterion for clustering? Is there a criterion that 
makes sense to all data? We may answer this question by asking "what does 
cluster mean?" The word, cluster, means (1) a group of the same kind, or (2) 
a group of things close together. These explanations describe the nature of 
clustering, and are indeed how human perception identifies clusters: elements 
that are close or resembled belong to the same group. 

The formulation of a criterion, such as minimizing the sum of within group 
variations by k-means, is a conceptually and technically useful approach to the 
problem of clustering. However, it does not pertain to the nature of clustering 
as described above. From this point of view, hierarchical clustering is favored by 
us, because it employs the idea of closeness and resemblance between elements. 
One weakness of hierarchical clustering is that elements are merged too fast. 
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We will show in Section 4 that our algorithm slows down the merging process 
and is benefited from it. 

In addition to defining a clustering criterion, another challenge for cluster- 
ing algorithms is when the amount of noise in the data is substantial. Such 
situations arise very often in the analysis of gene expression data, in which a 
significant number of observed genes have distinct expression profiles and do 
not co-regulate with other genes. The high level of noise from these so called 
"scatter noise genes" is likely to obscure or even distort a meaningful underly- 
ing expression pattern. The identification of such genes is therefore crucial to 
the performance of a clustering method. Recently several proposals have been 
made for clu stering data with noise set. Some methods allow noise to remain 
unclustered (jFralev and Raftervl . 12002 ; iTseng and Wontd . |2005l ). Some inc orpo- 
rate f unctional annotat ion of genes into the analysis of microarray data ([Pan . 
2006HShen etail l2010h 



The self-updating process is a clustering algorithm that overcomes the afore- 
mentioned problems and challenge for clustering. The development of this algo- 
rithm wasjriitiated as an extension to the generalized association plots (GAP) 
( Chenl . 120021 ). which was at first a clusterin g method and w as later integrated 
into a platform for exploratory data analy sis (I Wu et aL L 2010l ). GAP utilizes the 
iterative generated correlation matrices ( McquittvT 19681) .according to which 
data points are gathered towards the left and the right sides of an ellipse at 
each iteration and eventually merge into two clusters. To extend GAP, we first 
introduced a threshold on correlation to increase the number of clusters the 
iterative matrices converge to. We later moved the iteration process from the 
correlation space to the raw data space, that can display the actual movements 
of data points. In the resulting clustering process, each data point continues 
updating its own location until the whole system reaches a balance condition. 
We therefore named this algorithm Self- Updating Process (SUP). 

Since the beginning of this work in the year of 2006, we considered the 
idea of SUP original and new. Not until recently have we become aware that 
this idea has been introduced into the l i terature as th e blurring mean-shift 
algorithm (jFukunaga and Hostetlerl . 11975 ; IChend . Il995h . In contrast to SUP 
that was exclusi vely developed for the problem o f clustering, the original mean- 
shift algorithm (jFukunaga and Hostetlerl 119751 ) made its first appearance for 
kernel density estimation, which uses the sample mean within a local region 
to estimate the gradient of a den sity function. The mean-shift algorithm was 
further extended and analyzed bv lGhenel (|l995h . i n which a generalized v ersion 
called "blurring" mean-shift is equivalent to SUP. IComaniciu and Meerl (|2002j) 
applied the mean-shift algorithm to the problem of image segmentation. The 
algorithm has become more well-known in the computer science community 
since then. 

While we independently came up with the same idea as the blurring mean- 
shift algorithm, we also made great efforts to develop the updating system and to 
study its properties. For example, we defined parameters to have specific statis- 
tical meanings, in order to better reflect the intuition behind the idea. We stud- 
ied the influence of parameters on clustering results, and we provided a guideline 
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for parameter estimation. We made a thorough study on the performance of 
this algorithm, and we conducted simulation experiments to demonstrate its 
strengths. We also discovered a distinctive characteristic of this algorithm: it 
has the ability to simultaneously isolate noise while performing clustering. This 
characteristic makes the algorithm particularly powerful for data that contains 
a significant amount of noise. We present this characteristic by a simulation 
example in Section 14.11 and by a gene expression data in Section 15.11 All of 
the aforementioned work, to our knowledge, has not yet been reported into the 
literature. 

In this paper we present a compete development of SUP. The paper is orga- 
nized as follows. Section 2 introduces the clustering algorithm SUP. In Section 
3, we provide a mathematical proof that guarantees the convergence of SUP. 
Section 4 presents simulations that demonstrate the performance of SUP and 
shows comparisons with other methods. Illustrative examples are given in Sec 



tion 5, including a clustering analysis of gene expression data ([Golub et al 



19991) and an application to image segmentation. A discussion is presented in 
Section 6. 



2 Self-Updating Process 

2.1 The idea 

The central idea of self-updating process can be illustrated by the following 
example. 

Recall times when we were students and our teacher asked us to divide into 
groups to play a game. What did we do? We might walk directly towards 
our good friends. We might even ask some of those next to us whether they 
wanted to be in the same group while we walked. On the teacher's side, he/she 
would see that students are moving. Gradually and eventually, the groups were 
formed. 

Based on this childhood experience, we describe our algorithm as follows. 
Suppose there are N elements to be clustered, and there are p random variables 
representing elements' information. The data is a N x p matrix, which can be 
viewed as N data points in a p-dimensional space. When the updating process 
begins, the movement of a data point is determined by its relationship with other 
data points. The relationship, for example, can be friendships and locations as 
in the previous example. We can quantify the relationships based on elements' 
information using measures such as the correlation and the Euclidean distance. 

2.2 The main algorithm 

The self-updating process is formulated as follows. 

(i) ij ',^ 11 ', • • • , x$ € R p are data points to be clustered. 
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(ii) At time t + 1, every point is updated to 



♦ ^ ft (*) (*)\ J ' y ' 

j'=l l^k=l J\ X i ' X k ) 

where f{xf\xp) is some function that measures the influence between 
x^p and x^p . 

(iii) Repeat (ii) until every point converges. 

When two data points are closer, the influence between them should be stronger. 
We therefore assign a larger value to / when if* and Xj are closer, and interpret 

f{x^i\x^) as the mutual influence between point i and point j at the i-th 
update. In plain words, equation ([1} says that the next location where point i 
moves to is determined by influences it receives from all data points at present, 
including point i itself. In statistical terminology, x\ t+ is the weighted average 
of all xf>% for i e {1,...,N}. 

Throughout this paper we use / as an exponential decay function with re- 
spect to some distance d: 



f eM-d{xf\xf)/Tl d{xf\xf)<r 
\ 0, d{xf\xf)>r, 



(2) 



where we select d(x^ , Xj ) as the Euclidean distance between locations of point 
i and point j at the i-th update. In the next subsection we use a simple example 
to demonstrate how SUP operates. We also illustrate the roles of r and T in 
the example. Other formulations of f's are discussed in the final section. 



Note that in the original mean-shift algorithm (|Fukunaga and Hostetlerl . 



197|, / in © is defined as a flat kernel, which is an indicator function rep 



resent ing whether t he distance between two points is less than some threshold 
value. IChend (1995) further generalized / to be any kernel function, while most 
of the mean shift applications used a truncated Gaussian as the kernel. In this 
paper we use a truncated Exponential function as written in @, because expo- 
nential decay is very often observed in nature. Moreover, in the later discussion 
we show that the parameter T can be " dynamic" : the parameter value decreases 
with iterations. That is to say, the updating process is no longer homogeneous. 
This is one case not considered in the mean shift algorithm. 



2.3 A simple illustration 

Three data points from bivariate normal distributions BVN{iik 1 J2/25) were 
sampled for each k € {I,..., 9}, where fx k € {(0,0), (2,0), (1,1), (6,0), (8,0), 
(7,1), (3,3), (5,3) and (4,4)}. A total of 27 sampled points were plotted in 
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FigureQIa). We first used the / in © with r = 0.9 and T = 0.7. Figures [Q»- 
(d) illustrate how SUP updated the location of each data point. At t = 3, the 
27 data points converged to nine points without further movements afterwards. 
These nine points represent final locations of the resulting nine clusters. 



(a) 1.0 (b) t-1 
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Figure 1: A graphical presentation of SUP with r=0.9 and T=0.7 
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Figure 2: A graphical presentation of SUP with r=3.5 and T=0.7 



To illustrate the effect of r, we increased r from 0.9 to 3.5. Figure [H[a)-(f) 
present the updating process and the final clustering result, in which data points 
converged to three locations instead of nine. Indeed, when we take a look at the 
original sampled data in Figure [Ua), it is subjective to conclude whether there 
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are three or nine groups. This is the role we assign r to play, that defines the 
range of influence. In this example, a choice of r = 3.5 forced each data point 
only to be influenced by those who were within 3.5 units. As a result, squares, 
circles and triangles of the same colors were jointly influenced, and in the end 
of the process converged to the same locations. Generally speaking, the use of 
a small r value produces clusters of compact sizes, within which data points are 
less heterogeneous. Without the use of r, or equivalently, when r is infinite, 
Corollary [T] in the next section proves that all data points eventually converge 
to a single cluster for any strictly positive / function. 



@ Q 



Figure 3: A graphical presentation of SUP with r=3.5 and T=0.5 



To illustrate the effect of T, we changed T from 0.7 to 0.5, keeping r at 3.5. 
Figure O^a)- (h) present the updating process and the final clustering result. A 
comparison between Figure [2] and Figure [3] shows that data points converged at 
a slower rate when T is smaller. This can also be seen from @: When T is small, 
f(u,u) = 1 is much larger than f(u,v) for every v ^ u. Data point u therefore 
hardly moves as the influence from itself totally dominates; When T is large, 
similarly we can conclude that the movements of data points are correspondingly 
large. If we consider data points as particles in a statistical mechanical system, 
parameter T then serves as the "temperature". Data points move fast at a 
high temperature T, therefore accelerate the convergence speed. That is to say, 
parameter T determines the rate of convergence of SUP. Different T values may 
produce different clustering results, which are due to different merging speeds. 



2.4 Parameter Estimation 

In self-updating process, there are two parameters to be determined. One is 
r and the other is T. If there is a training set, Cross- Validation is a standard 
method to estimate the parameters. However, in practice we rarely have this 
additional information to learn the parameter values. In the following we present 
simple data-driven methods for parameter estimation. 
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2.4.1 The influential range r 



The selection of the influential range r can depend on the estimated probability 
density function of the pairwise distance. The valleys and peaks of the density 
function provide useful information about good and poor candidates of r. The 
reasoning is as follows. 

We begin with a simple situation when there are only two clusters in the data. 
Because of the two-cluster structure, the pairwise distances of pairs that contain 
one point from each cluster should not differ much, meaning that the estimated 
probability density function of the pairwise distance has a large probability 
mass in the range of these between-cluster distances. Similarly, the pairwise 
distances of pairs that contain both points from the same cluster should not 
differ much. There should also be a large probability mass in the range of 
these within-cluster distances. To select an influential range r that produces 
clusters retaining the original structure of two clusters, we should avoid these 
distances at peaks that are likely to be the between- or within-cluster distances. 
Otherwise, the updating process may easily distort the original structure of data 
and may consequently result in a poor clustering performance. 

The same reasoning applies to data of more than two clusters. Following this 
reasoning, a good candidate of r is either larger or smaller than the distances at 
peaks. A more desirable choice is a distance at valley with a small probability 
mass. This valley selection minimizes the chance that the updating process 
distorts the data structure, because the number of pairs that are influenced by 
the value of r is small compared to other choices of r values. 

We take the data presented in Figure QJa) as an example. We use frequency 
polygon to approximate the probability density function of the pairwise distance. 
Figure S] presents the frequency polygon, in which valleys are at around 0.9, 2.5, 
3.5, 5.1, 6.8 and 7.7. We showed in Section [2~3l that the use of r = 0.9 produced 
three clusters and that of r — 3.5 produced nine clusters. For the rest of the 
valleys, r = 2.5 produced an identical clustering result as r ~ 3.5, while r = 5.1, 
6.8 and 7.7 moved all data points into one single cluster. 



Figure 4: The frequency polygon of the pairwise distances. 
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2.4.2 The temperature T 



As we learned in Section l2~3"l that the temperature T determines the convergence 
rate of SUP, one may favor a large T value to reduce the computation time. 
It is, however, not usually the case. When data points move fast at a high 
temperature, mistakes are often made. 

To select a temperature T, we consider the following. Let data points j 
and k be r — 5 and r + S units away from data point i, respectively, where 
r is the selected influential range. When S is small, i is about r units away 
from both j and k. It is reasonable to assume that i receives approximately the 
same amount of influence from j and k. According to however, the actual 
influences that i receives from j and k are exp[— (r+S)/T] and zero, respectively. 
It therefore makes more sense to use a small T value such that exp(— r/T) is 
close to zero. When T = r/5, exp(— r/T) is 0.0063. Our experiments showed 
that SUP with T = r/5 very often produced good clustering results within a 
reasonable computing time. 

An alternative choice to a static temperature of T = r/5 is a dynamic tem- 
perature that increases with time. In the beginning of the updating process, 
data points move at a lower temperature to slow down the merging. The tem- 
perature is gradually increased with time to accelerate the updating process. 
For dynamic temperature, we propose to use T — r(l/20 + </50). The initial 
temperature is only r/20. After t = 8, the temperature exceeds r/5. From our 
experiments, the use of a dynamic temperature is usually capable to handle a 
wider variety of data. 



3 Convergence 

In this section we prove the convergence of the self-updating process. 

Th e convergence of blurring mean-shift algorithm was studied by [Ch eng 
dl995h . which however, included only two cases. The first case was w hen the 



mutu al influence between each pair of elements is nonzero. Theorem 3 in IChend 
(|l995l) showed that in this case all elements eventually converge to a single 



cluster. The second case is under the assumpt ion tha t elem ents can never move 
arbitrarily close to each other. Theorem 4 in IChend (Il995h guaranteed that in 



this second case the algorithm converges in finite steps. 

Our result is more general than Theorem 3 and Theorem 4 in IChend(ll995l) . 



where the Theorem 3 is equivalent to our Corollary [T] which is an immediate 
implication of our main result presented in Theorem [1] The convergence of self- 
updating process requires the function / to be PDD (positive and decreasing 
with respect to distance), that we define in the following. 

Definition 1. The function f in Uty is PDD (positive and decreasing with 
respect to distance), if 

(i) < f(u, v) < 1, and f(u, v) = 1 only when u = v. 

(ii) f(u,v) depends only on \\u — v\\, the distance from u to v. 
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(Hi) f(u,v) is decreasing with respect to \\u — v\\, 

Since f(u,v) represents the influence that data points u and v receive from 
each other, we assume a larger value of f(u, v) when u and v are closer, as 
stated in condition (iii). We also assume that the influence is solely determined 
by the distance between u and v, meaning that /(tii, v%) — f(u2, V2) whenever 
\\ui — wi|| = ||«2 — v 2\\, as stated in condition (ii). In principle, f(u,v) can 
be negative, suggesting that u and v repel each other. However, when f(u,v) 
is negative at some iteration and since / is decreasing, u and v would move 
further and further apart as the process keeps updating. The whole system will 
consequently diverge and never reach a balance condition. This is the reason 
we require f(u,v) to be non-negative. We can define f{u,u) to be any positive 
number. For simplicity we normalize it to be one. 

With a function / that satisfies PDD condition, the following theorem guar- 
antees the convergence of SUP. 

Theorem 1. If the function f in (Qp is PDD, there exists {x%,X2, ■ ■ ■ ,xn}, 
such that 

lim Xj = xj Vi. 

To prove theorem [T] we introduce lemma [T] lemma [2] and lemma [3] 

Lemma 1. Let C± be the convex hull of {x^ 

Cf ] D Cf> D-..D Cf 2 • • • . 

Proof. The convex hull C(X) for a set of points X is the minimal convex set 
containing X. Since 

N 

T (*+i) _ ^1 

i=i 

xf +1 ^ is a weighted average of xp for j = 1, • • • , N. Therefore, 

4 t+1) ecf . 

Since the above is true for each i, we have 

r (t) ~> r((^ t+1) 7- (t+1) . . . -r (t+1) \\ - r (t+1) 

=±! <-A\ x i ,x 2 ,---,x N fj — ^i 

□ □ 

Note that the nested structure presented in lemma [1] ensures the convergence 
of convex hulls {C^}. Let C\ be the limit of c[ l \ 

00 

C x = lim C<P = Hcf. 

t=0 
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On the other hand, since the convex hull of any finite set of points in R p is a 
polytope, each is a polytope. Each vertex of therefore must contain at 
least one xf^ for some i, otherwise the polytope would have been smaller. With 
the convergence of convex hulls {C 1 t ' ) }, lemma [5] claims that there are at least 

some data points {x\ } which converge to vertices of C\. The proof is included 
in Appendix It4.1I 

Lemma 2. If the function f in (QP is PDD, for each vertex v\a of C\, there 
exists at least one j, such that 

lim xf' = via. (3) 

Having shown that at least some points converge under SUP, hereafter we 
consider the rest of the data points. Let Ox be the set of points shown converged 
to the vertices of C\. Define C% be the convex hull of {x\ Note that 

{C^} may not be nested at early stages of iterations: points not in Q,i may 
move outside the current convex hull C% due to the influence from Q±, the 
volume of the convex hull therefore may increase by iteration. This nested 
property, however, would hold after some iteration when all data points in Oi 
converge. Explicitly, 

2 C { 2 t+1) after some t, 
which also implies the convergence of {C^}, 

C 2 = lim C ( 2 t} . 

We introduce the following lemma [U which can lead to the nested property of 
{C^}- It states that when all data points in Cli converge, points in f2i receive 
no influence from points not in otherwise they would have been attracted 
inwards. That is to say, data points not in fix also no longer receive influence 
from points in fii, meaning that the influence from points in fii goes down to 
zero. The proof of lemma [3] is included in Appendix 1^4.21 

Lemma 3. For any Xi such that i £ Oi, 

lim f(xf\xf ) )=0, 

for all j such that 

lim xf 7^ lim x\^ . 

From the above, we can claim a similar result for Ci as lemma[2]for C\. each 
of the vertex of C2 has at least one data point converges to. The same argument 
can apply again and again to C3, C4, • • • , until all data points converge. This 
completes the proof of theorem [T] 

Although theorem [T] guarantees the convergence of SUP when / has PDD 
condition, there are some /'s that produce trivial clustering results, in which 
all data points are clustered into one single group. We identify such /'s in the 
following corollary. 
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Corollary 1. Let tm be the maximum pairwise distance between any two data 
points. If f is PDD with /(rw) > 0, there exists c, such that 

lim x^ = c \/i. 

For any two points i and j, lemma Q] implies that \\x^ — %j \\ < Tm f° r 
every t. Since / is decreasing with respect to distance, the influence between i 
and j is always larger than /(tm)- If /0"m) > 0, then f{xf\x^) > /(tm) > 
for every i and j. Lemma [3] shows that, however, the influence between any 
two points which do not converge to the same position tends to zero. Thus, 
f(xf\x i f > ) > /(rjw) > for every i and j implies that all data points converge 
to the same position, as stated in corollary [1] 

For the purpose of clustering, it is not desirable to have all data points 
converged to the same position. To prevent trivial clustering results, / has to 
be zero on (r, oo) for some r < tm . 



4 Simulations and Comparison 

4.1 Data with noise 

In this simulation we consider data that contains noise, which in practice is very 
often present, for example, in gene expression data and in image data. Many 
clustering algorithms sometimes fail to produce reasonable results for data with 
noise, because scattered points of noise very often obscure the original structure 
of data, therefore make it difficult for algorithms to discover patterns. 

We demonstrate the performance of SUP in comparison with k-means algo- 
rithm, considering two approaches to solve the problem of local minimum that 
may result in a poor performance of k-means. The first approach was the use 
of multiple initial values. We allowed k-means to start with multiple sets of 
k randomly selected initial centers. Then the clustering result from the set of 
initial centers that had the minimum sum of within group variations was se- 
lected. It is clear that the use of multiple initial values can increase the chance 
for k-means to achieve its optimal performa nce. The second approa ch we con- 
sidered was the selection of initial values by iTseng and Won j ( 2005 ) , in which 



the hierarchical clustering was first applied to obtain results of k x p clusters for 
some pre-determined number of clusters k and some integer value p. Then the 
centers of the k largest clusters were taken as the initial centers for k-means. 
Tseng and Wong] (2005) showed that this method for selecting initial centers 



can achieve good clustering results. 

We used the example also presented by ITseng and Wone (l2005h . in which 



data of three clusters and a number of scattered points were generated as follows. 
Three clusters were sampled from standard normal distributions centered at (- 
6,0), (6,0) and (0 6), respectively. Each point in the cluster was restricted to lie 
within two standard deviations to its center. The scattered points representing 
noise were sampled uniformly from [-12, 12] x [-6, 12], but not within three 
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standard deviations to any of the three centers. We used different numbers 
of scattered points to represent data sets with varying degrees of noise: each 
simulated data has 50 points in each of the three clusters and n scattered points, 
where n can be 10, 50, 100 or 200. Figure [5] displays one example data, showing 
50 points in each of the three clusters denoted by circles, x-marks and pluses, 
and 50 scattered points of noise by dots. 

4r 




_10^ ' ' u ' 

5 10 



Figure 5: Three groups (circles, x-marks and pluses) and noises (dots) 




Figure 6: The frequency polygon of the pairwise distances with one valley at 
around 3.0618 

We simulated 100, 000 runs for each data size, and compared clustering re- 
sults from k-means algorithm and those from SUP. For k-means algorithm, re- 
sults were obt ained using random initia ls of one set, random initials of 100 sets, 
and initials by iTseng and Wonj ( 2005 ) using single and complete linkages with 
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Table 1: The numbers of incorrect results in 100,000 runs of simulations 



Number of noise 


10 


50 


100 


200 


K-means 


random initial 


9774 


1177 


1150 


1404 


100 sets of random initials 











709 


Single 
Linkage 


p=l 


6771 


1471 


1143 


1153 


P =3 


43 


3085 


2211 


1114 


p=a> 





551 


925 


963 


Complete 
Linkage 


P =i 


1849 


1380 


952 


1251 


P =3 


9 








444 


p=€> 


4597 


1 


1 


337 


SUP 


static T 











16 


dynamic T 











205 



Table 2: CPU time per run 



Number of noise 


10 


50 


100 


200 


K-means 


random initial 


0.0026 


0.0031 


0.0036 


0.0037 


100 sets of random initials 


0.1487 


0.1712 


0.2233 


0.2463 


SUP 


static T 


0.0197 


0.0558 


0.0922 


0.2481 


dynamic T 


0.0255 


0.0532 


0.0811 


0.1988 



p=l, 3, and 6. For SUP, we present results from both static and dynamic tem- 
peratures with T — r/5 and T = r(l/20 + 1/50), respectively, where the value 
of r was selected automatically at a time for each simulated data according to 
the frequency polygon of the pairwise distances. Figure [6] graphically shows 
the selection of r for the example data presented in Figure [S] The frequency 
polygon suggests the use of r = 3.0618. 

While the objective is to correctly cluster the 150 non-noise points, we com- 
pared SUP and k-means algorithm on the basis of the number of runs that 
produced correct clustering results: a run was taken as "correct" only when all 
of the 150 non- noise points were clustered correctly. Table [T] presents the num- 
ber of runs that were not correct for data containing different levels of noise. 
This table demonstrates the clustering performance of SUP, showing that SUP 
made considerably fewer mistakes than k-means algorithm. We also computed 
the running time in seconds for each run of the simulation. Table [2] focuses 
on the comparison between SUP and k-means algorithm with multiple initial 
sets, which also produced reasonable clustering performance as shown in Table 
[TJ This comparison suggests that SUP is competitive in computation efficiency. 

In addition to the 150 non-noise points, we examined clustering results of 
the scattered noise data points. We take the example presented in Figure [5] to 
illustrate the results. When r = 3.0618 was used, SUP produced 12 clusters. 
Three, four and five noise data points were grouped into the three clusters of 50 
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non-noise points, respectively. The rest of the 38 noise data points constituted 
the other nine clusters. When r decreased to 2.1, SUP produced 26 clusters, 
including three clusters of 50 non-noise data points and 23 clusters constituted 
exclusively by the 150 scattered noise data points. When r further decreased to 
1.5, SUP produced 34 clusters. The three clusters of 50 non-noise data points 
remained, and the number of clusters constituted by scattered noise data points 
increased to 31. 

To summarize, this simulation example shows that SUP has superior perfor- 
mance over k-means algorithm in clustering data with noise. In addition, SUP 
has the ability to separate the noise data points from the non- noise data points. 
This ability is further demonstrated by heat maps presented in Figure |TD] based 
on a gene expression data. 



4.2 Crowded Data 

When groups in the data are widely separated from each other, most of the 
clustering algorithms can produce good results. In this simulation example we 
focus on the type of data in which groups are closely separated. Fifty points were 
sampled for each of the three groups from bivariate normal distributions centered 
at (-4,5), (-5, -1) and (0,1) with (a x ,<r y ,p) as (5,2,0), (2,2,0) and (3,3,0), 
respectively. Each point was restricted to lie within one standard deviation 
from its center. The data points in the first group was further rotated counter- 
clockwise with respect to its center (—4, 5) to create a structure of an inclined 
ellipse. Figure [7{a) displays an example of simulated data, in which points from 
the three groups were colored in navy, red and green, respectively. 




Figure 7: (a) The example data, (b) The selection of r value, (c)-(j) Clustering 
results by SUP using different r values, where (c)-(f) are results from static 
temperature and (g)-(j) are from dynamic temperature. Data points that were 
clustered into the same group are displayed by the same color. 
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Since groups in this example were closely located, sometimes it is difficult 
to find a local minimum of pairwise distances for this type of data. Figure E{b) 
shows the frequency polygon of the pairwise distances for the example data 
presented in Figure EK a ) . There was an unclear valley at around 3.051. This r 
value produced four clusters, presented in Figures E£c) andJTJg) that are results 
from static and dynamic temperatures, respectively. To enlarge the cluster 
sizes, we increased the value of r by experimenting with various percentiles of 
the observed pairwise distances. Figures EJd)-(f) and Figures Eth)-(j) present 
results from taking r as the 35th, 40th, and 45th percentiles. Figures EJd)-(f) 
show results from static temperature and Figures [T^li)-(j) show results from 
dynamic temperature. 

It is noticeable that the use of dynamic temperature outperforms that of 
static temperature in this simulation example. The detailed description is as 
follows. Although Figure Etc) shows that a correct three-cluster result can 
be obtained by merging the navies and the greens, Figures EKd)-(f) show that 
the use of static temperature with other values of r either made one mistake 
or produced clustering results that displayed wrong data structure from the 
original data. Figures ED 1 ) and ED) , on the other hand, show that SUP with 
dynamic temperature using r as the 35th and 40th percentiles both produced 
perfect clustering results. When r increased to the 45th percentile, meaning 
that every point was influenced by almost half of the points in the data, the 
updating process assigned all data points to a single cluster, as shown in Figure 

ED). 

When groups in the data are closely apart as in this simulation example, data 
points should be updated much slower especially at early iterations of the pro- 
cess. This is the case when we use the dynamic temperature T = r(l/20 + i/50), 
where the initial temperature is only T = r/20, a considerably low temperature 
compared to the static temperature T = r/5. From this point of view, the use 
of dynamic temperature is more appropriate for clustering crowded data. 



(a} start with true centers (b) k=2 (e) k=3 

10 10 10 
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Figure 8: Clustering results by k-means algorithm, (a) is the result using the 
true centers as initials, and (b)-(f) are results from random initials. 

We also applied k-means algorithm and hierarchical clustering to the same 
data presented in Figure Eta). Figures [5{a)-(f) show results from k-means al- 
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gorithm, where Figure [SJa) was the result using the true centers (—4,5), (-5, 
-1) and (0, 1) as initials, and Figures [SJb)-(f) were results from random initials 
with varying fc's, where k is the pre-determined number of groups. These figures 
show that k-means was unable to capture the original structure of the data, even 
when we used the true centers as the initial values. 

While k-means is designed to minimize the total distances between each data 
point to its cluster centers, it may mistakenly assign data points that belong 
to a larger cluster to a nearby smaller one, as these points are in fact closer to 
the center of the smaller cluster. This explains what we see in Figures [5ta)-(f), 
that k-means is likely to fail when data has clusters of distinct shapes and sizes 
as in this simulation example. 




Figure 9: Hierarchical clustering results, (a)-(c) are results by single linkage, 
and (d)-(f) are results by complete linkage. 

The results by hierarchical clustering using single and complete linkage are 
presented in Figures [9ja)-(c) and Figures [9jd)-(f), respectively. The results by 
single linkage agree more to the clustering structure of the original data. How- 
ever, as it is often difficult to find a good threshold for cutting the dendrogram 
tree into a clean and meaningful clustering structure, some subsequent merging 
and rotation of the sub-trees are often necessary. 



5 Application 
5.1 Golub Data 

We use the gene expression data presented in lGolub et al. (1999) to demonstrate 



the clustering performance of SUP and its ability to separate noise. Three pre- 
processing steps were applied to Golub data that originally has expression values 
of 7129 genes from 38 patients. Normalizations of the expression values within 
arrays were also applied. The data we obtained was from the package "multtest" 
(version 2.8.0) of Bioconductor. This data contains pre-processed and normal- 
ized expression values of 3105 genes from 38 patients, among which 27 were with 
acute lymphoblastic leukemia (ALL) and 11 were with acute myeloid leukemia 
(AML). The following illustration of SUP includes: (i) discover gene patterns 
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that are mostly associated with ALL-AML distinction, an d (ii) classify the 3 8 
patients in this sample data using the 50 genes selected bv lGolub et al.l <|l999h . 



5.1.1 Discover gene patterns 

Recall that SUP updates each data point's location according to the function 
/ that measures the influence between every pair of points. The more simi- 
larity between two points, the more influence the two points receive from each 
other. From this point of view, the resulting clusters that have only one data 
point or two are considered isolated, showing no resemblance to the rest of the 
data. These isolated data points we often call "noise" . In the context of gene 
expression data, we consider these isolated data points as scattered genes. 

To perform SUP on genes, we first normalized the expression values by 
genes to ensure equal weight of each gene. Then we calculated the frequency 
polygon of the pairwise distances between genes to determine the influential 
range r. In the frequency polygon, there showed no clear valley. We then 
found that the use of r — 5 produced only one large cluster and many tiny 
clusters that were considered as noise. Figure [TU] presents clustering results by 
SUP using dynamic temperature with selected r values smaller than 5. It is 
easily noted that smaller r values produced tighter clusters that exhibited lower 
within-cluster heterogeneity. 

We summarize the clustering results by taking r = 4.6 as an example. The 
use of r = 4.6 produced 1478 clusters in total, among which there were only nine 
clusters of sizes larger than ten, and 1420 clusters that contained only one single 
gene. The sizes of the five largest clusters were 580, 349, 276, 176 and 38 genes, 
respectively. The largest and the fourth largest clusters, as shown in Figure 
fTOTc). corresponded to genes that had expression values above the mean (colored 
in red) for most of the ALL patients and below the mean (colored in blue) for 
most of the AML patients. The distinction between the first and the fourth 
largest clusters was the detailed expression patterns shown particularly in ALL 
patients. The second and the third largest clusters, in contrast, corresponded 
to genes that had high expression values (colored in red) for most of the AML 
patients. The distinction between the second and the third largest clusters was 
also the expression patterns shown in ALL patients: the third largest cluster 
exhibited uniformly low gene expression values, while the genes in the second 
largest cluster had both high and low values. 

To validate the perfor mance of SUP, we c ompared our clustering results to 



the 50 genes identified by iGolub et al.l (|1999I ) that were most highly correlated 



with ALL-AML distinction. We located the 50 genes in the clusters produced 
by r = 4.6, finding that 25 of them were included in the largest cluster, 24 were 
in the third largest cluster and one gene in the second largest cluster. When 
r = 4.7 was used, all of the 50 genes were found in the two largest clusters, with 
25 genes in each of the two. Figure [TUTd - ) shows that these two largest clusters 
corresponded to genes with high values in AML patients and genes with low 
values in AML patients, respectively. 

In this application, we also demonstrate the strength of SUP in isolating 
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Figure 10: The clustering results by SUP with dynamic temperature and vari- 
ous values of r. The side-bar next to each of the heat-map indicates patients' 
cancer types, where light and dark gray colors represent ALL and AML patients, 
respectively. The heat- maps display normalized gene expression values of 3051 
genes from 38 patients, with each row representing a patient's gene expression 
profile. 




■4 -1 • I 1 



Figure 11: The clustering results by k- means algorithm with different values of 
fc's. 



noise by a comparison to k-means algorithm. Figure [TT] presents clustering 
results by k-means algorithm with k = 20 and k = 50. The clusters produced 
by k-means were o f similar sizes, and the 50 significant genes identified by 
Golub et al.l (|l999f) dispersed to nine and twelve clusters when k = 20 and 
k = 50, respectively. As a result, meaningful gene expression patterns were 
difficult to obtain by k-means algorithm, unless we had a way to merge and 
clean the clustering results. 
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While Figure [TT] shows that k-means algorithm is incapable of separating 
scattered genes, this lack of ability to separate noise is in fact a common problem 
of most of the clustering algorithms, including hierarchical clustering. 



5.1.2 Classify patients 

We performed SUP on the 38 patients in this data, using expression values of 
the 50 genes that w ere most correlated with ALL-AML distinction identified by 
Golu b et al. (Il999h . Since the 50 genes were selected using exactly the same 



data from the 38 patients we want to classify, this clustering analysis basically 
has no practical value. The purpose of this analysis, however, is simply to 
compare the clustering performance of SUP to that of k-means and hierarchical 
clustering, taking this gene expression data of size 38 x 50 as an illustrative 
example. 

The pairwise distances between patients were calculated. Figure Q21 shows 
the frequency polygon, in which a clear valley was observed at around 9.8982. 
Using this r value, SUP with dynamic temperature produced two clusters of 
sizes 27 and 11. These two clusters identically corresponded to the two groups 
of patients, ALL and AML, meaning that SUP made 100% accurate distinction 
between the two types of leukemia. We also applied k-means algorithm and 
hierarchical clustering to this data. Results showed that k-means algorithm 
with k = 2 misclassified one patient, and hierarchical clustering with different 
linkages misclassified at least two patients. 




Figure 12: The frequency polygon of pairwise distances between patients 



5.2 Image Segmentation 

When pixels in an image are considered as elements to be clustered, the prob- 
lem of image segmentation becomes a clustering problem. In this example we 
demonstrate the performance of SUP on the problem of image segmentation. 
We selected six test images from "The Berkeley Segmentation Dataset" . Figure 
[T31 displays the test images. Each image is of size 240 x 160. 

The first question we encountered was: what information should be included 
in the data, that can best convey important characteristics of an image for the 
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purpose of segmentation? We consider two types of information. One is the 
position, and the other is the color intensity. The information on position is 
represented by two variables: the x and y coordinates of a pixel. The information 
on color intensity can be obtained through the components of a color model. 
Among several choices of the representation for color intensity, the YUV model 
that defines a color space in terms of one luma component (Y) for brightness and 
two chrominance components (U and V) for color corresponds more closely to 
the human vision perception of colors than the standard RGB color model. Since 
the performance of a segmentation result is mostly evaluated by human eyes, we 
consider the YUV information more appropriate for image segmentation. The 
data for each image is then presented by a matrix of dimension 38400 x 5: There 
are 38400 pixels in each of the test images, and the information of each pixel 
point is described by the five aforementioned variables: x, y, Y, U and V. 

The next question was: how to balance between the two types of information, 
position and color? When the information on position is given more weight in 
the process of segmentation, adjacent areas with distinct colors are more likely 
to be combined as one region. A contrary result would have happened when 
the information on color is given more weight. The weighting coefficient for the 
two types of information should therefore depend on the desired sizes of the 
resulting segmented regions. These desired sizes should reflect the true sizes of 
the shapes of the major objects in the image to be segmented. That is to say, 
the weighting coefficient should vary for each individual image. We introduced 
a scaling parameter a that serves as the weighting coefficient. This parameter 
is used to re-scale the x and y coordinates to x/a and y/a. 




Figure 13: Test images from "The Berkeley Segmentation Dataset" 

We applied SUP with dynamic temperature on the constructed data of five 
variables to cluster pixels. In the / function presented in (|2|), we replaced the 
Euclidean distance L 2 with the absolute distance L 1 to reduce the effect of 
possible large deviations in one dimension only. For each test image, various 
sets of parameter values for r and a were experimented, with r ranging from 60 
to 120 and a from 8 to 20. Figure ITU presents the best segmentation results for 
each image, showing nice segmentations simply by the use of SUP. A combined 
approach of SUP with other segmentation techniques is therefore expected to 
be very promising for segmenting images. 
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Figure 14: Image segmentation results of the test images in figure [13] by SUP 

6 Discussion and Conclusion 

The self-updating process is a simple, intuitive and powerful clustering algo- 
rithm. In the updating process, each element moves to a new position at each 
iteration. The new position depends on the influences the element receives from 
other elements. At the end of the process, every element reaches its equilibrium 
position without further movements. Elements that arrive the same position are 
considered to belong to the same cluster. This algorithm works straightforward. 
The convergence of the updating process is also proved. 

Although SUP is not particularly developed to handle noise in the data, this 
ability comes naturally as a byproduct. Recall that elements in the data are 
clustered on the basis of mutual influence, which is defined to be larger when 
two elements are closer. As a result, noise data that are not close to other 
data points is bound to be isolated at the end of the updating process. The 
noise data points can therefore be easily identified. In Section 14.11 and Section 
15. 1[ we showed the strength of SUP in separating noise. This strength offers a 
great advantage to the use of SUP, especially when the noise level in the data 
is substantially high. 

Compared to k-means algorithm that minimizes the sum of the within-cluster 
variations, SUP is not one of such clustering algorithms that optimize certain 
criterion functions. Although it is often appealing to have clustering results that 
represent specific statistical terms, such as the solution of k-means algorithm 
nicely represents "the minimizcr of the sum of the within-cluster variations" , 
there are times when these terms are not what we truly seek in the data. Section 
14.21 presented one example, in which the criterion used by k-means algorithm 
does not conform to the structure of the data. Under such a situation, the 
minimizer of the sum of the within-cluster variations can not reveal the true data 
structure. Poor clustering results by k-means algorithm presented in Section 
14.21 verified this point: the use of criterion functions for clustering is sometimes 
inappropriate. 

The self-updating process is in some sense a slow version of agglomerative 
hierarchical clustering: In SUP, two elements are merged gradually instead of at 
once. As one weakness of hierarchical clustering is that early mistakes cannot be 
corrected, slowing down the merging process especially at the beginning stage 
can very often reduce the chances of making mistakes. Later we realized that 
the connection between agglomerative hierarchical clustering and SUP is even 
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closer: When the function / in ((TJ) is generalized to be non-homogeneous in t, 
agglomerative hierarchical clustering with centroid linkage can be written as a 
special case of SUP. We write the non-homogeneous function f t as 

tt (*) (*h / !> d{xf,xf)<r<-*\ i ,\ 

« ; '4o = ( ; ; h ; rwis v w 

where the influential range changes at each iteration, 

r^= min \\xf - x[ t} \\. 

k^i and ilfVij" 

This function f f takes a positive value only when i = j, or when 7^ 2;^ and 

the distance between 2:^ and is the smallest among all non-zero pairwise 
distances. Using this f t , SUP at the first iteration only updates the pair that 
has the smallest pairwise distance. Both elements of the pair are updated to 
the averaged position of the pair according to ((TJ) . At later iterations SUP only 
updates the two groups that have the smallest between-group distance. Each 
element in the two groups is updated to the averaged position of all elements 
in the two groups. That is to say, SUP using the / function in ((4]) creates an 
identical merging process to agglomerative hierarchical clustering using centroid 
linkage. 

In addition to the exponential decay function in ^ that is used throughout 
this paper, and the function in Q that generates a process identical to agglom- 
erative hierarchical clustering using centroid linkage, SUP can turn into other 
clustering processes by the use of various types of / functions. When there is 
prior information about the data structure, it is often desirable to include the 
prior information in the clustering process. This can be done by incorporating 
such information into the function /, then SUP can turn into a model-based 
algorithm. We are interested in the formations of / functions for data with 
certain structures, such as the spiral data and data with specific probabilistic 
distributions. By incorporating information that characterizes some known data 
structure, the performance of SUP can be further enhanced. 



A Proof of Lemmas 

A.l Proof of Lemma [2] 

Since 

d = lim C[*\ 

t— >oo 

for each i, there exists a sequence of t>i ]'s (exchange vertex indices if necessary), 
such that 

lim vf] = vi t i, 
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where vf\ is a vertex of C[ . Since for any t and i, 



,,(*) - -W 

for at least one fc, there exists j, such that 

(t) (*) 
a;} = Uij 

for infinite many f's. Therefore, there exists an infinite time sequence t n 's, such 
that 

which leads to 

lim x^ n%> = Vii. 

n— too 

If = Vi \ except for any finite t, then equation ([3]) is established. Otherwise, 
there exists j' ^ j and another infinite time sequence s„'s, such that 

x^™" 1 = t 1 ^™'' Vn. 

Without loss of generality, assume that uj'] = or jcj-; for all t > i. From 

equation fl}, if a;^- = Xj, for some s, a:^ = xp for all t > s. Therefore, for 

any s > 0, there exists t > s, such that = a;^ and v^ 1 ' = Xj, +1 \ We 
claim that this case, however, can never happen: when t is large enough, it is 
impossible that a data point inside the convex hull later becomes a new vertex, 
since it is closer to other points than the current vertex is. In the following 
we prove this claim only for the one dimensional case. For higher dimensional 
cases, one can project all data points onto to a proper line, and then make the 
same argument. 

Without loss of generality, assume Vi, :i — 0, Xj < 0, and a:^ > for k j 
or j' . If 2y/ + later becomes the new vertex, then 

E/fcMV*? 5 E/(*S t) .*? , )-*? ) 

fc=l , k=l 



N N 



< ^Tr ■ (5) 



k=l k=l 



We claim that the inequality above is false. Since 
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To prove that equation (JS|) is false, it suffices to show that 



xtf+x^ 



N 



(*) jt) 



k 



fc=l 



(*) , (*) 

(i + /(4-4 t) ))-^4^+ E /(4-4 t) )-4 4) 

N 

EM^f) 
fe=i 

We claim the inequality above is true, and complete the proof of lemma [5] based 
on the followings: 

(i) Since x * is the vertex, \\x^ — x k \\ > — a;£ || for all k, and hence 

/(*?>,*?>) 4")- 

(ii) Since a^* is the new vertex, it is is non-positive. Then 

xf+xf < 4* ) +/(^ ) ,xf)-4 t) 



2 i + Z^^f) 

N 

(*) „,(*h „(*) 



EW*n-*r 

< *S 

E/(4^4 t} ) 



fc=i 

(hi) Since ||xj- — t>i,i| < e, a^- > — e, and hence 3 2 3 > — e. e can be 

(*) i (*) 

chosen arbitrary small so that x k > — — 2 3 | for fc ^ j, j . 

(iv) The following lemma. 

Lemma 4. Suppose x\ < 0, Xk > \x±\ for all k = 2, • ■ • , n, ai = &i > 0, and 
afc > 6^ > for all k = 2, • • • , n. J/ 



E 



O^kXk 



— < 

E afc 
fe=i 



2G 



then 

n n 

k=i k=i 

n n 

k=l k=l 



Proof. Without loss of generality, assume that x\ = — 1, otherwise, we can 
divide XkS by Let 

n 

fc=i 

c = — . 



k=l 

Since c is the weighted average of x^s and Xk > — 1, — l<c<0. Define 

a fe (a;fc-c) 

Cfe = — : Vfc. 

c+ 1 

Note that Cfc > for k > 2, since a;*; > for k > 2. Then 

E V ak(Xk - c) 
Ck ~ 2^ c + i 

k=i fe=i 



_^ n n 

= — -(VatXfe-cVaj.) 
c + 1 * — ' * — ' 

fc=i fe=i 

^ n n 

— akXk ~ ^2 akXk ^ 



c+ ~ k=i fe=i 





y^c fc = -ci 

fe=2 

= Oi. 
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Therefore, 

n n 

^a k x k -ai + y^gfcXfc 



k=l k=2 



fe=l k=2 

n n 

fc=2 fe=2 

n n 

c fc + a fc 

fc=2 fe=2 

n 

y^ + afeXfe 

fc=2 

n 

^ c fc + a k 

k=2 



For each k > 2, it is obvious that 

-Cfc + QfcXfc -Cfc + fofcZfc 
Cfc + <2fc Cfc + 6fc 

as afc > 6fe means that the left-hand side puts more weight on x k . Furthermore, 

a k (x k -c) , 

-c k + a k x k _ — ^pj r a k x k 



C k + a k ' a k (xk-c) +Qfc 



Thus, 



QfcQfc - c) + (c + l)a fc x fc 
afc(£fc - c) + (c + l)a fe 

c«fc(^fc - 1) 
a/cOfe - 1) 

c. 



~^b k x k ~^-c k + b k x k 

k=l k=2 



y^ bk y^ ^ + 

fc=l fe=2 

< c 

n 

^ afc^fc 
fc=i 



fe=i 

□ 
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A. 2 Proof of Lemma [3] 

Without loss of generality, assume that xf 1 is the only data point that converges 
to v iyl . 

E/c^Uf)-? 5 

3=1 _ T (*+l) 

N i 

3=1 

N 

3 - ^ = o 

E/^'-f) 

3=1 

=* E/(*i t) ,*? ) )-(*y ) -*r i, )=*i m, -^ ) . (6) 

Since xf converges to u^i, 2^* and icj^ become arbitrarily close to each 
other when t is large enough. That is, the right-hand side of © goes down to 
zero. On the other hand, since x^ does not converge to v^i for j ^ i, there 
is a gap between x~- and xf . To force the left-hand side of (J6j) to be zero, 

f(xf\x^) must goes down to zero as well. This sketches the proof for lemma 
[3J The precise details are given in the following. 

Because Xj does not converge to v^i for j ^ i, there exists e > 0, for any 

to > 0, there exists t > to such that \\xj — Vi ; i\\ > e. In fact, x^p can not go 
arbitrarily close to v^i when t is large enough, otherwise the updating process 
will move x^p and xf closer and closer to each other. That is, there exists 
eo > and ti such that \\x^ — Vi,i\\ > ei for all t > t\. On the other hand, 

because xf' — > t^i, for any e 2 > 0, there exists t%, such that ||a?j — || < e 2 
for t > t 2 . 

Since is a vertex of the convex set C\, there exists i 6 Ci, such that the 
inner product of v\^i and v\^j is positive for any y 6 C\. Let 

iiui.t^ir 

There exists a > and £3 > f 1 such that 

(xp - ui,i,« x ) > ot\xy - vx,i\\ Vt > f 3 and Vj ^ i, 
where (, ) denotes the inner product. Take the inner product of both sides of 
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^ with v x , we have 

N 
iV 

= E/^U?)-^? 3 -«i,*,«w) + 

iV 

N 

for t > ^3, and 
for t > Therefore, for t > ma,x(t^,t2), 

N 

Since ti can be arbitrarily small, the inequality above implies 

N 

E/^f)^- 

Since / > 0, f{xf\xf) for all j ^ i. 
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